The Age Friendly Communities (AFC) project requires several key datasets (pulled from a variety of data sources) to answer the data questions that are part of it. This document is intended as a guide to the aggregated AFC dataset as well as a record of any analysis and visualizations applied to this dataset in answering the data questions.
The exploratory analysis detailed in the rest of the document attempt to answer questions under Data Questions. These questions and additional contextual information can be found in this document. Details of the analysis along with visualizations pertaining to each data question are under Analysis. A summary of the insights drawn from the analysis including notes on additional work needed to be carried out and caveats (if any) can be found under Conclusions.
A. RCFE Capacity
B. 65+ Population
C. San Diego’s Alzheimer’s Population
D. Low-Income Seniors
E. Race
In [13]:
import os
import sys
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import pylab, mlab, gridspec
from IPython.core.pylabtools import figsize, getfigs
from IPython.display import display, HTML
from pylab import *
# GLOBALS
# working directory
CWD = os.getcwd()
# Data File(s)
VERSION = "20170321"
DATAFILE = "afc_" + VERSION + ".csv"
datapath = os.path.join(CWD,DATAFILE)
# read the data file into a data frame
df = pd.read_csv(datapath)
The AFC dataset was generated using data (wholly or partially) from a number of disparate data sources aggregated along the following geographical IDs (where possible) specific to the San Diego county:
For a list of the individual data sources that contributed to the AFC aggregated dataset please see this resource.
In [14]:
# a custom definition of an info method callable on a series
def sinfo(x):
xu = x.unique()
l = len(x)
typ = str(np.array(xu.tolist()).dtype) + " "
nans = "null: " + str(l - x.count()) + " "
lgt = "len: " + str(l) + " "
unq = "unq: " + str(len(xu))
return (typ+nans+lgt+unq)
# function to output summary information (dimensions,data types among others) for
# the specified dataframe
def summarize(df):
# output the number of rows and cols of the data frame
print "Dimensions: {}".format(df.shape)
# print the variables, the number of unique values and a list of unique values for each col
print "{}".format(df.apply(lambda x:(sinfo(x),x.unique())))
summarize(df)
df.head()
Out[14]:
NOTES
In [15]:
# group by Region
df_g_region = df.groupby('Region')
# compute counts for key indicators for each group
indices = []
cols = ['NumRCFELicensed','NumRCFEBedsLicensed','NumRCFEInALWP','2012Pop65Over','2030Pop65Over',
'2012PopMinority','2012MedianHHIncome','2012MedianHHIncome65Over','2012PopLowIncome65Over']
# create a 2-D array to hold information for each group
dim = (len(df_g_region.groups),len(cols))
data = np.zeros(dim, dtype=np.float)
pos = 0
for region, group in df_g_region:
indices.append(region)
sra_data = group[group['Zipcode'] == 0]
#display(sra_data)
sra_data_sum = sra_data[cols].sum(axis=0).to_frame().T
#display(sra_data_sum)
data[pos] = sra_data_sum.values
pos = pos + 1
df_region = pd.DataFrame(data=data,columns=cols,index=indices)
df_region = df_region.apply(lambda x: x.astype(int))
display(df_region)
# convert population and income data to kilo units
df_region[cols[3:]] = df_region[cols[3:]].apply(lambda x: x/1000)
df_region['NumRCFEBedsLicensed'] = df_region['NumRCFEBedsLicensed'].apply(lambda x: int(x/100))
# convert to long format for visualization
df_region = df_region.T
region_list = df_region.columns.tolist()
In [12]:
# this configuration below prevents pylab from importing anything into the global namespace
# needed to prevent user warning about namespace clashes
#%config InteractiveShellApp.pylab_import_all = False
#%pylab inline
%matplotlib inline
matplotlib.rcParams['font.size'] = 10
matplotlib.rcParams['font.weight'] = 'bold'
# plot the above in a bar graph
width = 0.3
labels = ['NumRCFELicensed','NumRCFEBedsLicensed','NumRCFE (ALWP)','65+ (2012)','65+ (2030)','Minorities (2012)',
'Med HH Income','Med HH Income (65+)','Low Income (65+)']
pos = np.arange(len(labels))
x_ticks_pos = pos + (0.15 * width)
num_regions = len(region_list)
fig, axes = plt.subplots(num_regions, 1, figsize=(15, 25))
for i,region in enumerate(region_list):
plot_vals = df_region[region]
bars = axes[i].bar(pos,plot_vals,width,color='#00688B',alpha=0.5)
for j, bar in enumerate(bars):
ht = bar.get_height()
bar_pos = bar.get_x() + bar.get_width()/2.
ht = ((ht+1) * -50) if plot_vals[j] < 0 else ht+10
txt = '{:.0f}K'.format(plot_vals[j]) if j>2 else '{:.0f}'.format(plot_vals[j])
txt = '{:.0f} (x 100)'.format(plot_vals[j]) if j==2 else txt
axes[i].text(bar_pos, ht, txt, ha='center', va='bottom')
axes[i].spines['top'].set_visible(False)
axes[i].spines['right'].set_visible(False)
axes[i].set_xlim(min(pos)-width, max(pos)+width)
axes[i].set_ylim([0, max(plot_vals)] )
title = region
axes[i].set_title(title,fontsize=14, fontweight='bold',color='#FF8C00')
axes[i].title.set_position([.5,1.05])
#axes[i].grid(True)
if i == (num_regions-1):
axes[i].set_xticks(x_ticks_pos)
axes[i].set_xticklabels(labels,rotation=45,fontsize=11, fontweight='bold')
else:
axes[i].tick_params(axis='x',labelbottom='off')
axes[i].margins(0.3, 0)
plt.show()
Tested for Python versions: Python 2.7.12 :: Anaconda custom (64-bit)